server : speculative checkpointing#19493
Conversation
ggerganov
left a comment
There was a problem hiding this comment.
I think this is good as a prototype, but we must find a way to encapsulate this logic in common/speculative. We should keep server clean from extra speculative-related logic so that it is easier to maintain and introduce new speculative approaches later on.
Qwen3-Coder for auto-complete
I also use this model for auto completion. Which IDE/client do you use?
|
For llama.cpp I use |
c591189 to
0fa66c2
Compare
0fa66c2 to
665893a
Compare
|
I added a I would like to run some tests and make a few minor edits. |
58d612a to
edc9b88
Compare
|
Sample arguments for the --spec-type ngram-mod --draft-max 48 --spec-use-checkpoints on --ctx-checkpoints 12
# or
--spec-type ngram-map-k --draft-max 48 --spec-use-checkpoints onTest result for 'quicksort' with Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf: I'm currently running additional tests and investigate the Drafts with |
|
I did some testing and debugging but dit not completely get it. The log says: Here are my build options: cmake -B build \
-DGGML_VULKAN=ON \
-DGGML_USE_OPENMP=ON \
-DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_AVX_VNNI=ON -DGGML_AVX_BMI=ON -DGGML_FMA=ON\
-DGGML_SSE42=ON -DGGML_F16C=ON \
-DGGML_NATIVE=ONAnd I enabled it for llama-cli for testing: ./build/bin/llama-cli -m ~/models/Qwen3.5-9B-UD-Q8_K_XL.gguf -md ~/models/Qwen3.5-0.8B-UD-Q8_K_XL.gguf --spec-type ngram-map-k --draft-max 48 --spec-use-checkpoints on --temp 0.0 -p "explain --spec-type ngram-map-k" -v -lv 4Using: @@ -3441,7 +3441,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
throw std::invalid_argument("unknown speculative decoding type without draft model");
}
}
- ).set_examples({LLAMA_EXAMPLE_SERVER}));
+ ).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--spec-ngram-size-n"}, "N",
string_format("ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram (default: %d)", params.speculative.ngram_size_n),
I hope that helps. If there is something I could try to debug, let me know. |
|
@stsydow The current PR can't be used with a Qwen3.5 draft model. I'm trying to add checkpoints to the draft model (when using recurrent modules) but a draft model creates many more invalid drafts than |
|
Testing Qwen3.5-27B with draft model Qwen3.5-0.8B looks promising (not yet in this PR), but there is a bug: the main model gets confused by the drafts. I am preparing a commit. |
|
The previous commit added optional checkpoints in the draft model implementation of common/speculative.cpp. quicksort-test using results: I'm still running additional tests. |
| add_bos_token = llama_vocab_get_add_bos(vocab); | ||
|
|
||
| if (params_base.speculative.has_dft()) { | ||
| // TODO speculative: move to common/speculative.cpp? |
There was a problem hiding this comment.
Yes, we should move the draft model loading to common/speculative.cpp. But in a separate PR
278bb0a to
e0c2f92
Compare
|
Thanks for you effort! I get some tg improvement on 27B/0.8B with this PR, but it was more an experiment far from a real benchmark. I tried to rebase but there is a conflict with 96cfc49 , which looks like a fix to the problem I saw earlier. |
e0c2f92 to
f39bb5d
Compare
|
@stsydow This PR has been rebased. I can reproduce a GGML_ASSERT when the server switches its slot in a chat. |
Test Report: PR #19493 on DGX Spark with Qwen3.5-35B-A3BTest Environment:
Test Results: ✅ Server starts successfully with ✅ Speculative decoding initializes correctly: ✅ Chat completion works: Single requests process successfully with ~62 tokens/sec ✅ NO CRASHES DETECTED:
Key Finding: Minor Observations:
Conclusion: Tested by automated agent on DGX Spark hardware |
The reference |
|
I did some more testing and am well above 30% speedup in token generation for code and around 10% down for normal text where the drafts don't match well. The assert is also fixed. so LGTM |
Came here to say I'd help test to get this across the line. But then saw this. Let's goooooo |
|
A parallel test, "generate quicksort in C/Java/python", passes in main-branch d417bc4 but fails in feature-branch 1f62966. Update: The model used in this test was Qwen3.5-27B-UD-Q5_K_XL, not Qwen3.5-122B-UD-Q8_K_XL. Update 2: But the main-branch can fail, too. The parallel-test seems to introduce some randomness. |
|
How do you test "PASS" vs "FAIL"? |
1f62966 to
e0a61a5
Compare
The python client writes the output_text into a file and compares the expected file with the actual content. So this test is at text-level only, not at token-level. from openai import OpenAI
[...]
response = client.responses.create(
model=args.model,
input=prompt
)
output_text = response.output_textI rebased and will do this checks again including #20288. |
|
@srogmann One more thing - the prompt with Qwen 3.5 is 30 tokens long. I don't know what hardware you have, but for example on Apple Silicon (i.e. Metal backend), we use different I was able to workaround this by using a larger prompt so that we always use the matrix kernel: 28c28
< prompt = f"Write a quicksort demo in {lang}, no comments."
---
> prompt = f"Write a quicksort demo in {lang}. Please, write just code. Do not write any extra comments."Results: # server
./bin/llama-server -m Qwen3.5-35B-A3B-Q4_K_M.gguf --port 8014 --reasoning off -np 4 --no-kv-unified --no-cache-prompt
# test
./run_spec.py --url http://127.0.0.1:8014/v1 --tests seq,seq,par
>>> Starting Test Sequence Step 1/3: SEQ
--- Running Mode: SEQ ---
Total duration for seq: 11.01s
ℹ First run detected. Storing results as baseline.
>>> Starting Test Sequence Step 2/3: SEQ
--- Running Mode: SEQ ---
Total duration for seq: 10.87s
--- Comparing against Baseline ---
C: MATCH (len 796)
Java: MATCH (len 960)
Python: MATCH (len 346)
>>> Starting Test Sequence Step 3/3: PAR
--- Running Mode: PAR ---
Total duration for par: 8.96s
--- Comparing against Baseline ---
C: MATCH (len 796)
Java: MATCH (len 960)
Python: MATCH (len 346)
ALL TESTS PASSED |
|
Testing ngram-mod checkpointing on a coding agent workload Setup:
I am running home grown coding agent benchmark, inspired by Aider Polyglot but using pi. It generates solutions to exercism problems across multiple languages (Python, C++, JavaScript), then compiles and runs the test suite. The agent uses multi-turn tool calling (write file → run tests → iterate). Results: 0/5 exercises passed, before I interrupted. Without ngram speculation I typically get 100%. Failure modes are all code quality — the structured output (tool calls) parses fine, but the generated code is subtly wrong:
|
|
@petter-b I saw a degradation in larger workloads, too. Therefore I wrote scripts as above to set up reproducible tests to analyze the point where differences in the output appear. |
@ggerganov I compiled llama.cpp from scratch using CPU (x86_64) only (instead of using CUDA as before). When I use one client the results are equal. As soon as I start another client in another shell the first client gets different results. I simplified the test-client into the following one-liner. It would be interesting to know why the first client gets influenced by the other one. Regarding this PR: When using one thread I can reproduce a case where a restored checkpoint gets a different sampling in this PR. I am digging into that. python3 -c "import requests,json; [print(f'result: len={len(full)}, text=...{json.dumps(tail)}') for _ in range(20) for full in [requests.post('http://127.0.0.1:8080/v1/responses', json={'model': 'local', 'input': 'Write a quicksort demo in C. Please, write just code. Do not write any extra comments.'}).json()['output'][0]['content'][0]['text']] for tail in [full[-80:]]]" |
It's still using the unified KV cache. You need to explicitly provide the llama-server ... -np 4 --no-kv-unified ... |
e0a61a5 to
7d2814a
Compare
@ggerganov I got feedback by @petter-b , I get reproducible results in CPU-only mode when I disable flash-attention, too. With flash attention I had a difference in the logits after the first speculative batch. This helps to compare the results of normal operation and speculative decoding. |
|
@srogmann Could you clarify? Is there still an issue you are looking into? If yes, how can it be reproduced? |
@ggerganov Yes, I am still investigating an issue when using a "quicksort prompt" (as above). I added a logging of the logits: SLT_INF(slot, "slot decode token, id=%d, n_ctx = %d, n_tokens = %d, truncated = %d\n",
slot.sampled,
slot.n_ctx, slot.prompt.n_tokens(), slot.truncated);
if (slot.prompt.n_tokens() > 54) {
int32_t n_vocab = llama_vocab_n_tokens(vocab);
auto * mem_logits = llama_get_logits_ith(ctx, slot.i_batch);
SLT_INF(slot, "last logits: [%.5f, %.5f, %.5f, ..., %.5f, %.5f, %.5f]\n",
mem_logits[0], mem_logits[1], mem_logits[2],
mem_logits[n_vocab - 3], mem_logits[n_vocab - 2], mem_logits[n_vocab - 1]);
}After the first refused draft (because draft-max is too small) the logits change. Some tokens later the tokens differ. I use log without speculative decoding: log with speculative decoding: |
|
I made a comment here earlier and this is the feedback that @ggerganov is referring to. However, I removed it because I needed to double check a few things. I have used Claude Code have relied heavily on Claude code to research, debug, test, code and even write this PR comment. A bit of fun experiment on my side and I am not going to submit any code here as it is against he rules for the repo. Validation is through TDD — each bug has a failing test before the fix, verified on CUDA RTX 3090 and to some extent Mac Metal M4. Test below are done on Qwen3.5-0.8B-BF16 and Qwen3.5-27B-UD-Q4_K_XL, CUDA RTX 3090. Code, tests, eval scripts, and raw results are available at petter-b/llama.cpp#1. Logit divergence after draft rejection@srogmann's finding (Mar 22) was reproduced: after a checkpoint restore following draft rejection, logits diverge.
One-line fix after llama_memory_seq_rm(llama_get_memory(ctx_impl.ctx), slot_id, ckpt.pos_max + 1, -1);For recurrent memory, From srogmann's logs (Mar 18 comment), the divergence appears at token 264 after ~10 sequential identical requests, with logits shifting progressively across runs. Duplicate KV cells without seq_rmRuntime instrumentation was added to detect duplicate
11,487 unique
The Sample log output: Throughput comparisonW1: multi-turn iterative code generation (quicksort suite), Qwen3.5-0.8B-BF16,
With the Without the fix, draft acceptance is substantially higher (3,957 accepted drafts on turn 4). Draft model speculation (0.8B → 9B and 0.8B → 27B)Draft model speculation was also tested (Qwen3.5-0.8B drafting for 9B and 27B targets). Draft model W1 quicksort, diagnostic counters:
Rejection rate is ~45% on both targets. Draft model acceptance does not scale with target model size. ReproductionHardware: RTX 3090, CUDA Model: Server flags: Request payload: {
"model": "t",
"messages": [{"role": "user", "content": "Write a quicksort demo in C. Please, write just code."}],
"max_tokens": 500,
"temperature": 0
}Multi-turn W1 workload: 4 sequential chat requests accumulating message history:
To reproduce the duplicate KV cell count: add a scan after each |
The logits change because they had been computed when processing the draft, they belong to the rejected draft. When I use the llama_state_seq_flags = 0 instead of The changes are not yet in this PR. |
This PR is a follow-up to #19270 (see #19267 ) to support the use of speculative decoding with recurrent modules using checkpoints. The use of checkpoints is not as fast as
llama_memory_seq_rm, because in case of a partially accepted draft, we have to go back to the checkpoint and execute a shorter batch.However, in use cases such as the quicksort example in #19164 , we observe a large speedup (in this very repetitive case!), so this PR.
This PR contains a small fix of the
ngram-map-kimplementation.Questions / open tasks:
ngram-map-kuses the accept-feedback to shorten its drafts. I haven't looked how into how to execute a batch without sampling (this would be fine when repeating a shorter draft without reusing the speculative implementation).make room).llama_state_seqfunctions in this PR correct?server log using Qwen3-Coder-Next, arguments
--spec-type ngram-map-k --draft-max 48 --spec-ckpt-num-tries 2 --ctx-checkpoints 16with quicksort prompts from #19164 :AI usage: Qwen3-Coder for auto-complete (common.h :-) ), some questions to MiniMax-M2.1.